New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LWIP TCP socket close - disconnecting fix #10476
Conversation
@tymoteuszblochmobica, thank you for your changes. |
@mikaleppanen @kjbracey-arm Any opinions about this? For me, this sounds like a dirty workaround. We are going to do a busy loop, where non-blockiness is expected. This surely fix the issue, but I wonder if there is any more elegant solution? |
features/lwipstack/LWIPStack.cpp
Outdated
do { | ||
wait_ms(5); | ||
check_limit--; | ||
} while (tcp != NULL && tcp->state < TIME_WAIT && check_limit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially racey - the PCB that tcp
is pointing at could be deallocated/reused while you're waiting, because you called delete.
It seems that lwIP does have netconn_shutdown
- use that prior to the wait, and then keep monitoring s->conn->pcb.tcp
. (I'm guessing that can become null due to the shutdown):
Something like:
if (NETCONNTYPE_GROUP(s->conn->type) == NETCONN_TCP) {
netconn_shutdown(s->conn, true, true);
while (s->conn->pcb.tcp && s->conn->pcb.tcp->state < TIME_WAIT && check_limit) {
wait();
}
}
Still got a minor/theoretical thread-safety issue on polling s->conn->pcb.tcp
though - not sure about how we lock against the tcpip thread. Is there a callback we could get to know when the state changes?
Alternatively, have you considered configuring SO_LINGER in lwIP? Even if we don't expose the option via NSAPI's setsockopt, you could probably get lwIP to give this wait behaviour by setting it based on your config option.
I think you'd have to poke the option flag and linger timeout value directly in, imitating the code in lwip_getsockopt_impl
, and then mark the socket blocking before the delete to activate it.
@SeppoTakalo - feels a bit icky to me too, but probably not that bad in practice. Don't forget that lwip's connect is also blocking when it shouldn't be... And many off-board stacks probably do stall for a while in their "close" operations. This is at least controllable via an option, and doesn't preclude adding more direct control later like public SO_LINGER setsockopt API. |
@tymoteuszblochmobica Please update |
I checked that if (NETCONNTYPE_GROUP(s->conn->type) == NETCONN_TCP) { doesn't work due to : netconn_shutdown also calls lwip_netconn_do_close_internal which sets conn->pcb.tcp to NULL So it doesn't provide useful information about PCB tcp FSM and waiting loop exits immediately. Unfortunately there is no callback from tcp FSM state change. Making socket blocking and enabling LINGER option also cannot work without LWIP code modification. I this case there is no unsent and unacked data and tcp_shutdown sends FIN to server and immediately returns. Maybe my solution proposal is ugly but i wanted to keep LWIP code unchanged. |
Okay, that shutdown implementation is unfortunate. As the comment in "netconn_close" says "shutting down both ends is the same as closing". :( Why??? You could just shutdown write and then poll - that is what actually is important in terms of the TCP protocol and state machine - it sends the FIN. Shutting down read is an assertion that you don't expect any more data (just a FIN) and makes it send RST if they do (to indicate what they sent has been ignored). That's less critical - it doesn't directly affect the state machine, except in that "drop with reset" case where more data arrives. If data does come in while you're polling, because you only shutdown write, I believe it should still send a RST when you close without having read it. (rst_on_unacked_data in tcp_close_shutdown). |
c7c5bc3
to
9bb4526
Compare
If only shutdown write is selected netconn_shutdown(s->conn, false, true) works now as we expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad that technique works. Should be fine. Bit more refinement.
features/lwipstack/mbed_lib.json
Outdated
@@ -100,6 +100,10 @@ | |||
"help": "Maximum number of retransmissions of SYN segments. Current default (used if null here) is set to 6 in opt.h", | |||
"value": null | |||
}, | |||
"tcp-close-timeout": { | |||
"help": "Maximum number of 5ms intervals for TCP close handshaking timeout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the config be milliseconds - exposing 5ms interval multiple is a bit weird. Even if you have 5ms granularity, you can subtract 5 from check_limit
the loop (but make it signed).
Or I'd be quite happy with 1ms polling anyway; I think being more responsive to close - shutting down 4ms faster - is probably more of a power saving than only checking every 5 milliseconds.
features/lwipstack/LWIPStack.cpp
Outdated
if (NETCONNTYPE_GROUP(s->conn->type) == NETCONN_TCP && s->conn->pcb.tcp->state == ESTABLISHED) { | ||
netconn_shutdown(s->conn, false, true); | ||
while (s->conn->pcb.tcp && s->conn->pcb.tcp->state < TIME_WAIT && check_limit) { | ||
wait_ms(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ThisThread::sleep_for(5)
is better - wait_ms
is on the way to deprecation.
Although, I wonder if you now get an event from lwIP when the state changes, so you can do an EventFlags::wait(TCP_CLOSE_TIMEOUT)
rather than a loop. Probably no biggie.
9bb4526
to
c1230bd
Compare
Replaced wait_ms ThisThread::sleep_for(1 ms) Tried to use callback for EventFlags eg. my_event_member::set(TCP_FLAG) and use my_event_member.wait_any(TCP_FLAG,timeout) instead of waiting loop. Calling sequence: recv_tcp calls It exits my_event_member.wait_any(TCP_FLAG only on timeout and LWIP::socket_callback is blocked (but its called from tcpip_thread context ). |
Not quite following your comment. Sounds like you attempted what I suggested - setting an event flag in socket_callback. My hope was that the callback would be called on the state transition you cared about so you could do something like That should work, but only if lwIP is actually sending an event at the right time. Maybe it isn't. If not, not worth messing with it. |
1806702
to
7fb87bd
Compare
If TCP FSM is in ESTABLISHED state, waits for TCP close handshaking until TIME_WAIT The purpose is to prevent eth/wifi driver stop and FIN ACK corrupt. This may happend if network interface disconnect follows immediately after socket_close.
7fb87bd
to
51610cc
Compare
After code cleanup and applying EventFlag again it started working as expected. |
Kevin could You check is it OK? |
CI started |
Test run: SUCCESSSummary: 11 of 11 test jobs passed |
Description
Currently LWIP TCP socket close doesn't care TCP close handshaking is properly completed.
If network interface disconnect follows immediately after TCP socket_close and TCP FSM is in
ESTABLISHED state this causes too early eth/wifi driver stop.
In result FIN ACK sequence is being corrupted and TCP FSM stucks in FIN_WAIT_1 state until LWIP netif is deleted. This is a reason of failing TCP tests due to lack of memory because TCP PCB mempool is not freed on time.
This fix check if socket is TCP and FSM is in ESTABLISHED state.
Then give extra time for connection close handshaking until TIME_WAIT state or timeout occurs.
Pull request type
Reviewers
@SeppoTakalo
@mikaleppanen
@kjbracey-arm
Release Notes